This document explores Tesla’s stock prices from its initial public offering on June 29, 2010, to March 17, 2017. Tesla’s share prices have experienced staggering increases over several years and machine learning algorithms could unveil trends of seasonality or autocorrelated variables. Predictive analysis, and within it, machine learning, can greatly influence investors from the uncertainty of the market. This time-series data aims to address the ability of machine learning models to use the time-series data to predict Tesla’s future market behavior. This investigation will utilize machine learning techniques on the historical prices to evaluate the direction of market movement, as well as forecast the value of the stock in the future. The research questions are as follows:
The future predictions will be compared to present day stock value to determine the accuracy of the algorithms, in addition to a review of the factors that have affected the stock prices to date (i.e. supply and demand, economy, stock splits, etc.).
library(quantmod)
library(dplyr)
library(lubridate)
library(summarytools)
library(corrplot)
library(tseries)
library(ggplot2)
library(plotly)
library(formattable)
library(dygraphs)
library(hrbrthemes)
stock_list <- "TSLA"
start_date <- 2010-06-29
end_date <- Sys.Date()
tesla <- NULL
for (i in seq(length(stock_list))){
getSymbols(stock_list, verbose = FALSE, src = "yahoo",
from=start_date,to=end_date)
temp_df = as.data.frame(get(stock_list))
temp_df$Date = row.names(temp_df)
row.names(temp_df) = NULL
colnames(temp_df) = c("Open", "High", "Low", "Close",
"Volume", "Adjusted", "Date")
temp_df = temp_df[c("Date", "Open", "High",
"Low", "Close", "Volume", "Adjusted")]
tesla = temp_df
}
The data analysis stage first consists of cleaning, and inspecting the data for inconsistencies. Following these steps, the data may undergo transformations and modelling as required. As part of the data preparation stage, the following steps will be taken:
head(tesla)
## Date Open High Low Close Volume Adjusted
## 1 2010-06-29 3.800 5.000 3.508 4.778 93831500 4.778
## 2 2010-06-30 5.158 6.084 4.660 4.766 85935500 4.766
## 3 2010-07-01 5.000 5.184 4.054 4.392 41094000 4.392
## 4 2010-07-02 4.600 4.620 3.742 3.840 25699000 3.840
## 5 2010-07-06 4.000 4.000 3.166 3.222 34334500 3.222
## 6 2010-07-07 3.280 3.326 2.996 3.160 34608500 3.160
str(tesla)
## 'data.frame': 2680 obs. of 7 variables:
## $ Date : chr "2010-06-29" "2010-06-30" "2010-07-01" "2010-07-02" ...
## $ Open : num 3.8 5.16 5 4.6 4 ...
## $ High : num 5 6.08 5.18 4.62 4 ...
## $ Low : num 3.51 4.66 4.05 3.74 3.17 ...
## $ Close : num 4.78 4.77 4.39 3.84 3.22 ...
## $ Volume : num 93831500 85935500 41094000 25699000 34334500 ...
## $ Adjusted: num 4.78 4.77 4.39 3.84 3.22 ...
tesla$Date <- as.Date(tesla$Date)
class(tesla$Date)
## [1] "Date"
The ‘Date’ attribute was changed to represent a date type variable.
sum(is.na(tesla))
## [1] 0
There are no missing values. In considering the COVID-19 pandemic, the Tesla dataset will be filtered to include the years from 2010-2019. Due to the uncertainty surrounding the economy in 2020, I believe there will be white noise present in the 2020 data.
tesla <- tesla %>%
mutate(Year = year(Date)) %>%
group_by(Year) %>%
filter(Year != 2020 & Year != 2021)
tesla <- tesla[1:7]
Next, a correlation plot will determine whether the attributes are correlated and to what degree.
x <- cor(tesla[2:7])
x
## Open High Low Close Volume Adjusted
## Open 1.0000000 0.9995588 0.9995524 0.9990113 0.4618287 0.9990113
## High 0.9995588 1.0000000 0.9994779 0.9996208 0.4710375 0.9996208
## Low 0.9995524 0.9994779 1.0000000 0.9995621 0.4526298 0.9995621
## Close 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
## Volume 0.4618287 0.4710375 0.4526298 0.4624999 1.0000000 0.4624999
## Adjusted 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
corrplot(x, type = "upper", order = "hclust")
To understand the attributes further, the measures of central tendency will be reviewed.
Descriptive Statistics of Tesla Stock
| Min | Q1 | Median | Mean | Q3 | Max | Std.Dev | |
|---|---|---|---|---|---|---|---|
| Open | 3.23 | 6.85 | 42.42 | 36.62 | 52.80 | 87.00 | 22.89 |
| High | 3.33 | 6.96 | 43.19 | 37.26 | 53.55 | 87.06 | 23.24 |
| Low | 3.00 | 6.70 | 41.61 | 35.96 | 52.02 | 85.27 | 22.52 |
| Close | 3.16 | 6.87 | 42.32 | 36.63 | 52.78 | 86.19 | 22.90 |
| Volume (M) | 0.59 | 9.37 | 22.65 | 27.17 | 36.26 | 185.82 | 23.60 |
| Adjusted | 3.16 | 6.87 | 42.32 | 36.63 | 52.78 | 86.19 | 22.90 |
These values allow us to see the range of values that are present in the Tesla stocks over time. Notably, the range of the minimum and maximum stock values is quite large, likely due to trends over several years. Since this is a time-series dataset from the intial public offering, it is unlikely that outliers are present, since the value of the stock has changed drastically over several years.
For the purposes of the forecasting investigation, the closed stock price (i.e. the value of the stock at the end of the day) will be used. The table below displays the trends of the closed stock prices from 2010-2019.
Closed Tesla Stock Price Statistics
| Year | Min | Max | Average | % Change per Fiscal Year |
|---|---|---|---|---|
| 2010 | 3.16 | 7.09 | 4.67 | 11.47 |
| 2011 | 4.37 | 6.99 | 5.36 | 7.29 |
| 2012 | 4.56 | 7.60 | 6.23 | 20.62 |
| 2013 | 6.58 | 38.67 | 20.88 | 325.42 |
| 2014 | 27.87 | 57.21 | 44.67 | 48.17 |
| 2015 | 37.00 | 56.45 | 46.01 | 9.44 |
| 2016 | 28.73 | 53.08 | 41.95 | -4.35 |
| 2017 | 43.40 | 77.00 | 62.86 | 43.49 |
| 2018 | 50.11 | 75.91 | 63.46 | 3.83 |
| 2019 | 35.79 | 86.19 | 54.71 | 34.89 |
Next, visualizations will be used to observe trends in the data. In order to ensure that the most accurate forecasts are obtained from the analysis, there are several aspects to consider when working with a time series dataset. The following data exploration will determine:
The first two visualizations will display the closing price of Tesla stock per day.
Histogram of Closing Price
Now that the closing stock prices have been visualized, as shown above, it is important to determine if autocorrelation is present within the data. Autocorrelation refers to the degree of linear similarity that is present between the data and a lagged version of the past data. In other words, this assessment determine if the data is previous data influenced the current observations. Along with this, it is important to test if there is partial autocorrelation. From the visualizations above, it is not easily apparent if there are the data has any autocorrelation. The following tests will provide a concrete calcaultion and visualization for the presence of autocorrelation.
Seasonality
Stationary
# Augmented Dickey-Fuller Test (adf test).
# A p-Value of less than 0.05 in adf.test() indicates that it is stationary.
adf.test(tesla$Close)
##
## Augmented Dickey-Fuller Test
##
## data: tesla$Close
## Dickey-Fuller = -2.7198, Lag order = 13, p-value = 0.2736
## alternative hypothesis: stationary
The ARIMA model requires additional technical indicators to be calculated to strengthen the available information. These indicators include the following:
ARIMA model
K-Nearest Neighbours